Churn rate, when applied to a customer base, refers to the proportion of contractual customers or subscribers who leave a supplier during a given time period. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from the competition, more successful sales and/or marketing by the competition, or reasons having to do with the customer life cycle.
Churn is closely related to the concept of average customer life time. For example, an annual churn rate of 25 percent implies an average customer life of four years. An annual churn rate of 33 percent implies an average customer life of three years. The churn rate can be minimized by creating barriers which discourage customers to change suppliers (contractual binding periods, use of proprietary technology, value-added services, unique business models, etc.), or through retention activities such as loyalty programs. It is possible to overstate the churn rate, as when a consumer drops the service but then restarts it within the same year. Thus, a clear distinction needs to be made between "gross churn", the total number of absolute disconnections, and "net churn", the overall loss of subscribers or members. The difference between the two measures is the number of new subscribers or members that have joined during the same period. Suppliers may find that if they offer a loss-leader "introductory special", it can lead to a higher churn rate and subscriber abuse, as some subscribers will sign on, let the service lapse, then sign on again to take continuous advantage of current specials. https://en.wikipedia.org/wiki/Churn_rate
In [2]:
%%capture
# Get our favorite packages from PyPI
! pip install plotly
! pip install cufflinks
! pip install h2o
# Import pre-installed packages
import numpy as np
import pandas as pd
# Suppress unwatned warnings
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("requests").setLevel(logging.WARNING)
In [3]:
# Load our favorite visualization library
import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)
# Sign into Plotly with masked, encrypted API key
#myPlotlyKey = os.environ['SECRET_ENV_BRETTS_PLOTLY_KEY']
py.sign_in(username='bretto777',api_key='conh5EnFad0Z9Lz6mVWr')
In [4]:
# Load some data
churnDF = pd.read_csv('https://s3-us-west-1.amazonaws.com/dsclouddata/home/jupyter/churn_train.csv', delimiter=',')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=False, value='Retain')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=True, value='Churn')
churnDFs = churnDF.sample(frac=0.07) # Sample for speedy viz
churnDF.head(10)
Out[4]:
In [5]:
# separate the calls data for plotting
churnDFs = churnDFs[['Account Length','Day Calls','Eve Calls','CustServ Calls','Churn']]
# Create scatter plot matrix of call data
splom = ff.create_scatterplotmatrix(churnDFs, diag='histogram', index='Churn',
colormap= dict(
Churn = '#9CBEF1',
Retain = '#04367F'
),
colormap_type='cat',
height=560, width=650,
size=4, marker=dict(symbol='circle'))
py.iplot(splom)
Out[5]:
In [6]:
%%capture
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
h2o.remove_all()
In [7]:
# Split data into training and testing frames
from sklearn import cross_validation
from sklearn.model_selection import train_test_split
training, testing = train_test_split(churnDF, train_size=0.8, stratify=churnDF["Churn"], random_state=9)
train = h2o.H2OFrame(python_obj=training).drop("State")
test = h2o.H2OFrame(python_obj=testing).drop("State")
# Set predictor and response variables
y = "Churn"
x = train.columns
x.remove(y)
The super learner is a prediction method designed to find the optimal combination of a collection of prediction algorithms. The super learner algorithm finds the combination of algorithms minimizing the cross-validated risk. The super learner framework is9 built in the theory of cross-validation and allows for a general class of prediction algorithms to be considered for the ensemble. http://biostats.bepress.com/ucbbiostat/paper266/ (Polley & Van der Laan, 2010)
In [8]:
# Reset variables
del allModels, gridGBM, gridRF, grids, dfGridGBM, dfGridRF, ensemble
In [ ]:
# Reset variables
del SuperModel, BestModel, Model3, Model4, Model5, Model6, Model7, Model8, Model9, Model10
In [9]:
%%time
# Specify GBM hyperparameters and stopping criteria for the grid
nfolds = 5
gbm_hyper_params = {"learn_rate":[0.075, 0.1], "nbins":[10,15,20],"ntrees": [20,30,40], "max_depth": [5,7,9], "sample_rate": [0.75, 0.8, 0.85, 0.9]}
search_criteria = {"strategy": "RandomDiscrete", "max_models": 6}
# Setup the GBM grid search
gridGBM = H2OGridSearch(model=H2OGradientBoostingEstimator(balance_classes=True, seed=123, nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True),
hyper_params=gbm_hyper_params,
search_criteria=search_criteria,
grid_id="gbm_grid_binomial")
# Start the GBM training
gridGBM.train(x=x, y=y, training_frame=train)
# Setup the Random Forest grid search
rf_hyper_params = {"mtries":[12,15,18],"nbins":[10,20,30], "ntrees": [25,50,75], "max_depth": [5,7], "sample_rate": [0.75, 0.8, 0.85, 0.9]}
gridRF = H2OGridSearch(model=H2ORandomForestEstimator(balance_classes=True, seed=123, nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True),
hyper_params=rf_hyper_params,
search_criteria=search_criteria,
grid_id="rf_grid_binomial")
# Start the Random Forest training
gridRF.train(x=x, y=y, training_frame=train)
# List the GBMs and Random Forests that we wish to ensemble
grids = gridGBM.model_ids + gridRF.model_ids
# Train the super learner
ensemble = H2OStackedEnsembleEstimator(model_id="GBM_RF_ensemble_", base_models=grids, training_frame=train, validation_frame=test)
ensemble.train(x=x, y=y, training_frame=train)
# Evaluate ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
# Compare the super learner to the base learners. First, combine all the base models into a single list, sorted by auc.
dfGridGBM = pd.DataFrame(data=gridGBM.get_grid(sort_by="auc", decreasing=True).sorted_metric_table())
dfGridRF = pd.DataFrame(data=gridRF.get_grid(sort_by="auc", decreasing=True).sorted_metric_table())
allModels = dfGridGBM.append(dfGridRF)
allModels['auc'] = allModels['auc'].astype('float64')
allModels.sort_values(by="auc", ascending=False, inplace=True)
allModels = allModels.reset_index()
baselearner_best_name = allModels.loc[0,'model_ids']
baselearner_best_auc = allModels.loc[0,'auc']
# Best stacked model auc
stack_auc_test = perf_stack_test.auc()
print("Best Base-learner Test AUC: " + str(baselearner_best_auc))
print("Ensemble Test AUC: " + str(stack_auc_test))
In [10]:
allModels
Out[10]:
In [11]:
best = h2o.get_model(baselearner_best_name)
importances = best.varimp(use_pandas=True)
importances = importances.loc[:,['variable','relative_importance']].groupby('variable').mean()
importances.sort_values(by="relative_importance", ascending=False).iplot(kind='bar', colors='#5AC4F2', theme='white')
Out[11]:
In [12]:
SuperModel = np.array(ensemble.roc(valid=True))
BestModel = np.array(h2o.get_model(baselearner_best_name).roc(xval=True))
Model2 = np.array(h2o.get_model(allModels.loc[1,'model_ids']).roc(xval=True))
Model3 = np.array(h2o.get_model(allModels.loc[2,'model_ids']).roc(xval=True))
Model4 = np.array(h2o.get_model(allModels.loc[3,'model_ids']).roc(xval=True))
Model5 = np.array(h2o.get_model(allModels.loc[4,'model_ids']).roc(xval=True))
Model6 = np.array(h2o.get_model(allModels.loc[5,'model_ids']).roc(xval=True))
Model7 = np.array(h2o.get_model(allModels.loc[6,'model_ids']).roc(xval=True))
Model8 = np.array(h2o.get_model(allModels.loc[7,'model_ids']).roc(xval=True))
Model9 = np.array(h2o.get_model(allModels.loc[8,'model_ids']).roc(xval=True))
Model10 = np.array(h2o.get_model(allModels.loc[9,'model_ids']).roc(xval=True))
layout = go.Layout(autosize=False, width=725, height=575, xaxis=dict(title='False Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')),
yaxis=dict(title='True Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')))
SuperModelTrace = go.Scatter(x = SuperModel[0],y = SuperModel[1], mode = 'lines', name = 'Super Model', line = dict(color = ('rgb(26, 58, 126)'), width = 3))
BestModelTrace = go.Scatter(x = BestModel[0],y = BestModel[1], mode = 'lines', name = 'Best Base Model', line = dict(color = ('rgb(135, 160, 216)'), width = 3))
Model2Trace = go.Scatter(x = Model2[0], y = Model2[1], mode = 'lines', name = 'Model 2', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model3Trace = go.Scatter(x = Model3[0], y = Model3[1], mode = 'lines', name = 'Model 3', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model4Trace = go.Scatter(x = Model4[0], y = Model4[1], mode = 'lines', name = 'Model 4', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model5Trace = go.Scatter(x = Model5[0], y = Model5[1], mode = 'lines', name = 'Model 5', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model6Trace = go.Scatter(x = Model6[0], y = Model6[1], mode = 'lines', name = 'Model 6', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model7Trace = go.Scatter(x = Model7[0], y = Model7[1], mode = 'lines', name = 'Model 7', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model8Trace = go.Scatter(x = Model8[0], y = Model8[1], mode = 'lines', name = 'Model 8', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model9Trace = go.Scatter(x = Model9[0], y = Model9[1], mode = 'lines', name = 'Model 9', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model10Trace = go.Scatter(x = Model10[0], y = Model10[1], mode = 'lines', name = 'Model 10', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
traceChanceLine = go.Scatter(x = [0,1], y = [0,1], mode = 'lines+markers', name = 'chance', line = dict(color = ('rgb(136, 140, 150)'), width = 4, dash = 'dash'))
fig = go.Figure(data=[SuperModelTrace,BestModelTrace,Model2Trace,Model3Trace,Model4Trace,Model5Trace,Model7Trace,Model8Trace,Model9Trace,Model10Trace,traceChanceLine], layout=layout)
py.iplot(fig)
Out[12]:
In [13]:
cm = perf_stack_test.confusion_matrix()
cm = cm.table.as_data_frame()
cm
confusionMatrix = ff.create_table(cm)
confusionMatrix.layout.height=300
confusionMatrix.layout.width=800
confusionMatrix.layout.font.size=17
py.iplot(confusionMatrix)
Out[13]:
In [22]:
cm.loc[0,'Churn'] * 75
Out[22]:
In [23]:
CorrectPredictChurn = cm.loc[0,'Churn']
CorrectPredictChurnImpact = 75
cm1 = CorrectPredictChurn*CorrectPredictChurnImpact
IncorrectPredictChurn = cm.loc[1,'Churn']
IncorrectPredictChurnImpact = -5
cm2 = IncorrectPredictChurn*IncorrectPredictChurnImpact
IncorrectPredictRetain = cm.loc[0,'Retain']
IncorrectPredictRetainImpact = -150
cm3 = IncorrectPredictRetain*IncorrectPredictRetainImpact
CorrectPredictRetain = cm.loc[0,'Retain']
CorrectPredictRetainImpact = 5
cm4 = IncorrectPredictRetain*CorrectPredictRetainImpact
data_matrix = [['Business Impact', '($) Predicted Churn', '($) Predicted Retain', '($) Total'],
['($) Actual Churn', cm1, cm3, '' ],
['($) Actual Retain', cm2, cm4, ''],
['($) Total', cm1+cm2, cm3+cm4, cm1+cm2+cm3+cm4]]
impactMatrix = ff.create_table(data_matrix, height_constant=20, hoverinfo='weight')
impactMatrix.layout.height=300
impactMatrix.layout.width=800
impactMatrix.layout.font.size=17
py.iplot(impactMatrix)
Out[23]:
In [15]:
print("Best learner AUC: " + str(baselearner_best_auc))
In [16]:
print("Super Model AUC: " + str(stack_auc_test))
In [17]:
print("Total customers evaluated: 534")
In [18]:
print("Total value created by the model: $" + str(cm1+cm2+cm3+cm4))
In [19]:
print("Total value per customer: $" +str(round(((cm1+cm2+cm3+cm4)/534),3)))
In [ ]:
# Save the model and upload it to s3
import os
from boto.s3.connection import S3Connection
from boto.s3.key import Key
In [ ]:
def upload_file_to_s3(myFile):
def get_bucket():
access= os.environ['SECRET_ENV_AWS_ACCESS_KEY_BRETT']
secret= os.environ['SECRET_ENV_AWS_SECRET_KEY_BRETT']
customer = 'demonstration'
conn = S3Connection(access,secret)
b = conn.get_bucket('dsclouddata',validate=False)
return b
s3_bucket = get_bucket()
k = Key(s3_bucket)
k.key = myFile
k.set_contents_from_filename(myFile)
k.make_public()
successMessage = "Uploaded %s to S3."%(myFile)
return successMessage
In [ ]:
myFile = h2o.save_model(ensemble, force=True)
# Upload the file
upload_file_to_s3(myFile)
In [ ]:
def pull_file_from_s3(key):
def get_bucket():
access= os.environ['SECRET_ENV_AWS_ACCESS_KEY_BRETT']
secret= os.environ['SECRET_ENV_AWS_SECRET_KEY_BRETT']
customer = 'demonstration'
conn = S3Connection(access,secret)
b = conn.get_bucket('dsclouddata',validate=False)
return b
s3_bucket = get_bucket()
payload = s3_bucket.get_key(key)
local_file = payload.get_contents_to_filename(key)
return key
In [ ]:
# download the model from s3
downloaded_model = pull_file_from_s3('/home/jupyter/GBM_RF_ensemble_')
# load the downloaded model into memory
ChurnPredictor = h2o.load_model(path=downloaded_model)
In [ ]:
def churn_predict(batchFile):
# connect to the model scoring service
h2o.init()
# load the user-specified file
newData = h2o.import_file(batchFile)
# open the downloaded model
ChurnPredictor = h2o.load_model(path=downloaded_model)
# evaluate the feature vector using the model
predictions = ChurnPredictor.predict(newData)
predictions = newData.cbind(predictions)
h2o.download_csv(predictions, 'predictions.csv')
upload_file_to_s3('/home/jupyter/predictions.csv')
return "Saved predictions.csv to your S3 bucket."
In [ ]:
churn_predict(batchFile = 'https://s3-us-west-1.amazonaws.com/dsclouddata/home/jupyter/churn_test.csv')
In [ ]:
def churn_predict(State,AccountLength,AreaCode,Phone,IntlPlan,VMailPlan,VMailMessage,DayMins,DayCalls,DayCharge,EveMins,EveCalls,EveCharge,NightMins,NightCalls,NightCharge,IntlMins,IntlCalls,IntlCharge,CustServCalls):
# define a feature vector to evaluate with the model
newData = pd.DataFrame({'State' : State,
'Account Length' : AccountLength,
'Area Code' : AreaCode,
'Phone' : Phone,
'Int\'l Plan' : IntlPlan,
'VMail Plan' : VMailPlan,
'VMail Message' : VMailMessage,
'Day Mins' : DayMins,
'Day Calls' : DayCalls,
'Day Charge' : DayCharge,
'Eve Mins' : EveMins,
'Eve Calls' : EveCalls,
'Eve Charge' : EveCharge,
'Night Mins' : NightMins,
'Night Calls' : NightCalls,
'Night Charge' : NightCharge,
'Intl Mins' :IntlMins,
'Intl Calls' : IntlCalls,
'Intl Charge' : IntlCharge,
'CustServ Calls' : CustServCalls}, index=[0])
# evaluate the feature vector using the model
predictions = ChurnPredictor.predict(h2o.H2OFrame(newData))
predictionsOut = h2o.as_list(predictions, use_pandas=False)
return predictionsOut
In [ ]:
State = "KS"
AccountLength = 1
AreaCode = 213
Phone = "362-1234"
IntlPlan = "no"
VMailPlan = "no"
VMailMessage = 0
DayMins = 0
DayCalls = 2
DayCharge = 20
EveMins = 120
EveCalls = 97
EveCharge = 7
NightMins = 2
NightCalls = 7
NightCharge = 10
IntlMins = 13
IntlCalls = 0
IntlCharge = 3.67
CustServCalls = 2
churn_predict(State,AccountLength,AreaCode,Phone,IntlPlan,VMailPlan,VMailMessage,DayMins,DayCalls,DayCharge,EveMins,EveCalls,EveCharge,NightMins,NightCalls,NightCharge,IntlMins,IntlCalls,IntlCharge,CustServCalls)